https://courseworks2.columbia.edu/courses/120274/assignments/503952?module_item_id=1036289
Submitted by: Harsh Dhanuka, hd2457
It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct our analysis.
Section 1: Initial Steps
Section 2: Data Cleaning and Preparation, Feature Engineering
Section 3: EDA of all variables and binning
Section 4: RF Model and SHAP Value Plots
Section 5: Future Considerations, LIME Model, Conclusion
SHAP Values (an acronym from SHapley Additive exPlanations) break down a prediction to show the impact of each feature. It is a method to explain individual predictions. SHAP is based on the game theoretically optimal Shapley Values. Where could we use this?
Lundberg and Lee (2016) proposed the SHAP value as a united approach to explaining the output of any machine learning model. Three benefits worth mentioning here.
SHAP Plot Variable Explanation

# Import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.ensemble import RandomForestRegressor
import plotly
import plotly.express as px
from collections import Counter
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 2 EDA/XYZloan_default_selected_vars.csv')
df.head(2)
print("Number of rows and columns in the dataset:")
df.shape
# Check basic statistics
print("Basic statistics of the columns are as follows:")
df.describe()
AP006¶df['AP006'].hist()
df.AP006.hist()
df['AP006'].value_counts()
loan_default¶# Check the target variable column
print("The number of 0's and 1's are:")
print(df['loan_default'].value_counts())
df['loan_default'].hist()
#df.info()
Unnamed: 0, Unnamed: 0.1 and id. They need to be dropped.
AP005 is a Date-Time column, which cannot be used for any predictions in the model. Date-Time columns act as an ID column and all have unique values, which misrepresents the variable while making predictions. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees. I will derive year, month, day, weekday, etc. from this field. In some models, we may use ‘year’ as a variable just to explain any special volatility in the past. But we will never use the raw DateTime field as a predictor.
TD025, TD026, TD027, TD028, CR012.
TD029, TD044, TD048, TD051, TD054, TD055, TD061, TD062.
AP002Gender,
AP003Education Code,
AP004Loan Term,
AP006OS Type,
AP007Application City Level,
AP008Flag if City not Application City,
AP009 Binary format,
MB007 Mobile Brands/type
.
.
AP005 to the relevant formats of Year, Month, Day ¶df['AP005'] = pd.to_datetime(df['AP005'])
# Create 4 new columns
df['Loan_app_day_name'] = df['AP005'].dt.day_name()
df['Loan_app_month'] = df['AP005'].dt.month_name()
df['Loan_app_time'] = df['AP005'].dt.time
df['Loan_app_day'] = df['AP005'].dt.day
# Drop old column
df = df.drop(columns = ['AP005'])
df.head(2)
df["AP002"] = df["AP002"].astype('object')
df["AP003"] = df["AP003"].astype('object')
df["AP004"] = df["AP004"].astype('object')
df["AP006"] = df["AP006"].astype('object')
df["AP007"] = df["AP007"].astype('object')
df["AP008"] = df["AP008"].astype('object')
df["AP009"] = df["AP009"].astype('object')
df["CR015"] = df["CR015"].astype('object')
df["MB007"] = df["MB007"].astype('object')
df['Loan_app_day_name'] = df['Loan_app_day_name'].astype('object')
df['Loan_app_month'] = df['Loan_app_month'].astype('object')
df['Loan_app_time'] = df['Loan_app_time'].astype('object')
df['Loan_app_day'] = df['Loan_app_day'].astype('object')
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1', 'id', 'TD025', 'TD026', 'TD027', 'TD028', 'CR012','TD029', 'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062'])
df.head(2)
As per all the variable description, all the following columns are either counts, lengths, or days. Hence, the negative values such as -999, -99, -98, -1, etc are all mis-read NA's and need to be converted back to 'nan' format.
features_nan = ['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
# Define a function to convert negatives to nan
def convert_to_nan(var):
df[var][df[var] < 0] = np.nan
for i in features_nan:
convert_to_nan(i)
# Verify that the negatives are gone
print("The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.")
df[features_nan].describe()
Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
df_2 = df[features_nan]
# Verify
df_2.head(3)
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=None, initial_strategy='median')
imp.fit(df_2)
imputed_data_median = pd.DataFrame(data=imp.transform(df_2),
columns=['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'],
dtype='int')
imputed_data_median.head(3)
CR009 to a category variable and bin appropriately ¶df['CR009'] = pd.cut(x=df['CR009'], bins=[-1, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1500000])
df = df.astype({'CR009':'object'})
df.CR009.value_counts()
.
.
corr = df[['loan_default', 'AP001', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024']].corr()
f,ax = plt.subplots(figsize=(18,12))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feeature from a pair which has over 0.7 ratio
# I will not remove variable here, as I am selecting the important variables based on previous RF and GBM models.
corr_var_drop1 = ['TD005', 'TD022', 'TD006', 'TD009', 'TD013', 'TD023', 'TD010', 'TD014']
# df = df.drop(columns = corr_var_drop1)
I will be using the other variables as they are all Call detail data.
filter_col = [col for col in df if col.startswith('CD')]
filter_col.append('loan_default')
corr = df[filter_col].corr()
f,ax = plt.subplots(figsize=(21,21))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feature from a pair which has over 0.7 ratio
# I will not remove variable here, as I am selecting the important variables based on previous RF and GBM models.
corr_var_drop2 = ['CD173', 'CD172', 'CD170', 'CD169', 'CD167', 'CD166', 'CD164', 'CD162',
'CD137', 'CD136', 'CD135', 'CD133', 'CD132', 'CD131', 'CD117', 'CD118',
'CD120', 'CD121', 'CD123', 'CD114', 'CD113', 'CD108', 'CD107', 'CD106',
'CD101', 'CD072']
# df = df.drop(columns = corr_var_drop2)
df_bin = df.copy(deep = True)
df_bin.head(2)
# Write a function and loop through
def binning(var):
df_bin[var + '_bin'] = pd.qcut(df_bin[var],15,duplicates='drop').values.add_categories("NoData")
df_bin[var + '_bin'] = df_bin[var + '_bin'].fillna("NoData").astype(str)
df_bin[var + '_bin'].value_counts(dropna=False)
features = ['AP001', # 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
# 'AP008', 'AP009',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
#'CR009', 'CR015',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'
# 'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
# 'Loan_app_day'
]
# categorical variables are commented out
for i in features:
binning(i)
# View the bins of some variables
print(df_bin['TD001_bin'].value_counts(dropna=False))
print(df_bin['TD022_bin'].value_counts(dropna=False))
% Y by X which is the mean column for all the numerical columns here ¶The 'mean' column represents the '% Y by X'.
def plot_X_and_Y(var):
z = df_bin.groupby(var + '_bin')['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var + '_bin']
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
#plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
#plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
#for i in features:
# plot_X_and_Y(i)
% Y by X which is the mean column for all the Categorical columns here ¶The 'mean' column represents the '% Y by X'.
features_2 = ['AP002', 'AP003', 'AP004', 'AP006', 'AP007', 'AP008', 'AP009',
'CR009','CR015', 'MB007', 'Loan_app_day_name', 'Loan_app_month',
'Loan_app_day'
]
def plot_X_and_Y_cat(var):
z = df_bin.groupby(var)['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var]
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
for i in features_2:
plot_X_and_Y_cat(i)
From the above graphs, the following variables seem to be not important, as they do not have a pattern or a trend, or a curve on the '% Y by x' graph:
Loan_app_day_namedf_count = df['AP006'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP006 - OS Type','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP006 - OS Type', y = 'Count', color = 'AP006 - OS Type',
width=600, height=400,
title = "Distribution of OS type")
fig.show()
df_count = df['AP002'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP002 - Gender','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP002 - Gender', y = 'Count', color = 'AP002 - Gender',
width=600, height=400,
title = "Distribution of Gender")
fig.show()
df_count = df['AP003'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP003 - Education','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP003 - Education', y = 'Count', color = 'AP003 - Education',
width=600, height=400,
title = "Distribution of Education")
fig.show()
fig = px.box(df, x="TD001",width=1000, height=500,
title = "Distribution of TD001 - TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = px.box(df, x="MB005",width=1000, height=500,
title = "Distribution of MB005")
fig.show()
fig = px.box(df, x="AP007", y="TD001",width=900, height=400,
color = "AP002",
title = "The Distribution of Level Application City by TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = sns.pairplot(df[['AP002', 'AP003', 'AP004']],
hue= 'AP004')
fig
.
.
# Over write the NA value columns, with the previously calculated values
df[features_nan] = imputed_data_median
df.head(2)
df.isnull().sum().sum()
# Selecting the top 10 variables based on above two Variable Importance graphs
vars_selected = ['loan_default',
'TD013',
'AP003',
'AP004',
'MB007',
'TD009',
'TD005',
'CR015',
'TD014',
'MB005',
'CD123'
]
df1 = df[vars_selected]
df1.head(2)
df1 = pd.get_dummies(df1,
drop_first = True)
# Drop the first level, to avoid multi-collinearity
df1.head(2)
df1.shape
X_var_list = df1.columns.to_list()
X_var_list = X_var_list[1:]
# X_var_list
Y = df1['loan_default']
X = df1[X_var_list]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
model = RandomForestRegressor(
max_depth = 6,
random_state = 0,
n_estimators=10)
model.fit(X_train, Y_train)
print(model.feature_importances_[:10])
importances = model.feature_importances_
indices = np.argsort(importances)
features = X_train.columns
var_imp_chart = pd.DataFrame()
var_imp_chart['Indices'] = indices
var_imp_chart['Variable'] = features
var_imp_chart['Importance'] = importances
var_imp_chart = var_imp_chart.sort_values(by = 'Importance', axis = 0, ascending = False)
print("The top 10 most important variables are: " + '\n')
var_imp_chart.head(10)
importances = model.feature_importances_
indices = np.argsort(importances)
features = X_train.columns
f, ax = plt.subplots(figsize=(12,40))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='g', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
import shap
shap_values = shap.TreeExplainer(model).shap_values(X_train)
# Determine the correlation in order to plot with different colors
corrlist = np.zeros(len(features))
X_train_np = X_train.to_numpy() # our X_train is a pandas data frame. Convert it to numpy
for i in range(0,len(features)):
tmp = np.corrcoef(shap_values[:,i],X_train_np[:,i])
corrlist[i] = tmp[0][1]
# plot it
#shap_v_abs = np.abs(shap_values) # Get the absolute values of all SHAP values
#k = pd.DataFrame(shap_v_abs.mean()).reset_index()
#k.columns = ['Variables','abs_SHAP']
#k2 = k.merge()
The shap.summary_plot function with plot_type = ”bar” lets us produce the variable importance plot.
A variable importance plot lists the most significant variables in descending order. The top variables contribute more to the model than the bottom ones and thus have high predictive power.
shap.summary_plot(shap_values, X_train, plot_type = "bar")
df.CR015.value_counts()
The SHAP Value Variable Importance plot need not be the same as the RF Model Variable Importance plot. For our analysis, we will only consider the SHAP Value plot, as it is more valuable and credible.
As we see, the most important variable is TD013 (Count of Queries in Last 6 Months), followed by the level 12 of AP004 (Loan Term), which is the 12 years of Loan term.
The SHAP value summary plot can further show the positive and negative relationships of the predictors with the target variable, and their influence on the predictions.
Y axis: The Left side, or the Y axis shows the different variables, sorted as per the variable significance, from top to bottomColor: Red means a high value of that variable, and Blue means a low value of that variableX axis: This is the SHAP value, which measures the contribution of that partocular variable to the target, which is prediction of the loan default. Points/Dots: Each dot represents each (one) observation in the data (X Train)shap.summary_plot(shap_values, X_train)
TD013 - This variable, or the Count of Queries in Last 6 months, has a positive relation with the loan default, in a way that, if this value is high (more red), it is likely that the value of loan default will be high. This means that a high count of queries in last 6 months is likely to be a loan default.
AP004_12 - This Variable, or the Loan Term of 12 years (has 4 levels - 3, 6, 9, and 12 years), has a positive relation to the loan default, in a way that, if the if this value is high (more red), it is likely that the value of loan default will be high. This means that a loan term of 12 years is likely to be a loan default.
AP003_3 - This variable, or the Education Code Level 3 (has 5 levels - 1, 3, 4, 5, 6), has a negative relation with the loan default, in a way that, if this value is high (more red), it is very less likely that the value of loan default will be high. This means that a higher education code or level 3 education is less likely to be a loan default.
AP003_4 - This variable, or the Education Code Level 4 (has 5 levels - 1, 3, 4, 5, 6), has a negative relation with the loan default, in a way that, if this value is high (more red), it is very less likely that the value of loan default will be high. This means that a higher education code or level 3 education is less likely to be a loan default.
CR015_6 - This variable, or the Months of Credit Card MOB Max Level 6 (has 5 levels - 2, 3, 4, 5, 6), has a negative relation with the loan default, in a way that, if this value is high (more red), it is very less likely that the value of loan default will be high. This means that a more months of credit card max or level 6 is less likely to be a loan default.
Similarly, we can interpret the other variables.
high chance of loan default is associated with the following characteristics, as per the above given explanations:¶The red color means a positive impact, and the blue means a negative impact.
def ABS_SHAP(df_shap,df):
#import matplotlib as plt
# Make a copy of the input data
shap_v = pd.DataFrame(df_shap)
feature_list = df.columns
shap_v.columns = feature_list
df_v = df.copy().reset_index().drop('index',axis=1)
# Determine the correlation in order to plot with different colors
corr_list = list()
for i in feature_list:
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
corr_list.append(b)
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
corr_df.columns = ['Variable','Corr']
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
# Plot it
shap_abs = np.abs(shap_v)
k=pd.DataFrame(shap_abs.mean()).reset_index()
k.columns = ['Variable','SHAP_abs']
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
k2 = k2.sort_values(by='SHAP_abs',ascending = True)
colorlist = k2['Sign']
ax = k2.plot.barh(x='Variable',y='SHAP_abs',color = colorlist, figsize=(8,40),legend=False)
ax.set_xlabel("SHAP Value (Red = Positive Impact)")
ABS_SHAP(shap_values,X_train)
To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset. Since SHAP values represent a feature's responsibility for a change in the model output, the plot(s) below represents the change in predicted loan default as the partocular variable changes.
To help reveal the interactions, dependence_plot automatically selects another feature for coloring. The SHAP plot function automatically includes another variable that our chosen variable interacts most with.
We may ask how to show a partial dependence plot. The partial dependence plot shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman 2001). It tells whether the relationship between the target and a feature is linear, monotonic or more complex.
In the dependence plots -
Left Y axis - The SHAP Value or the impact of the variable on the target Right Y axis - The variable which highly correlated wiht the main predictor variable. It gives the variable which interacts most with the main variable being studied. X axis - The variable itself, the actual values Color - The value of the second variable, high or low. Red is the high value, Blue is low valueTD013 - Count of Queries in Past 6 Months¶shap.dependence_plot("TD013", shap_values, X_train)
In this case, we see that:
TD013 or the count of queries in last 6 months increases, the value impact on loan default increases, making it more likely to be a default. There is a positive relation between the variable and the loan default value. AP003_4 or the Education Code Level 4 is low (more blue), it leads to a increase in the value of the loan default, making it likely to be a default. both the variables have a different impacts on the loan default value, and if both have high and low values respectively, it is very likely that it will be a loan default. There are more blue dots (low value of AP003_4, in the area where TD013 is high.Although, it is to note that TD013 has a much larger impact on the value of loan default than the AP003_4.
The plot shows there is an approximately linear and positive trend between “TD013” and the target variable loan default, and “TD013” interacts with “AP003_4” frequently.
A high chance of loan default is associated with a high value for TD013 and low for AP003_4, as per the above given explanations. Hence, giving out of loan should be considered.
AP003_4 - Education Code 4 (out of total 5 levels)¶shap.dependence_plot("AP003_4", shap_values, X_train)
In this case, we see that:
AP003_4 ir the education code, at level 4, which is a binary variable. Hence, the graphs only shows two values for the variable, 0 or 1. TD013 or count of queries in last 6 months is high (more red), the impact on loan default is distributed, but the value of loan default is generally the highest for a high value of TD013, making it likely to be a default. both the variables have different impacts on the loan default value, and if AP003_4 has low values coupled with high values of TD013, it is very likely that it will be a loan default.Although, it is to note that TD013 has a much larger impact on the value of loan default than the AP003_4.
The plot shows there is a binary trend, and "AP003_4" interacts with “TD013” at eitherr 0 or 1.
A high chance of loan default is associated with a high value for TD013 and low value of AP003_4, as per the above given explanations. Hence, giving out of loan should be considered.
MB005 - Years phone is active¶shap.dependence_plot("MB005", shap_values, X_train)
In this case, we see that:
MB005 or the number of years the phone is active increases, the value impact on loan default decreases, making it less likely to be a default. There is a negative relation between the variable and the loan default value. AP003_3 or the education code level 3 is high (more red), it still has no direct relation to the value of the loan default, however, towards the region where the loan default value is high, there are many blue dots, making it less likely to be a default predictor. both the variables have different impacts on the loan default value, and if AP003_3 has low values coupled with low values of MB005, it is very likely that it will be a loan default.Although, it is to note that AP003_3 or loan term has a much larger impact on the value of loan default than the MB005 or years of phone active.
The plot shows there is an approximately downward linear trend between “MB005” and the target variable loan default, and “MB005” interacts with “AP003_3” frequently.
A high chance of loan default is associated with a low value for MB005 and a low value for AP003_3, as per the above given explanations. Hence, giving out of loan should be considered.
CD123 - Count of Distinct Outbound Calls in past 3 months¶shap.dependence_plot("CD123", shap_values, X_train)
In this case, we see that:
CD123 or the count of distinct outbound calls in past 3 months increases, the value impact on loan default decreases, making it less likely to be a default. There is a negative relation between the variable and the loan default value. However, after a certain points, the relation stabilizes.TD013 or count of queries in past 6 months is high (more red), it leads to a increase in the value of the loan default, making it likely to be a default. both the variables have a different impact on the loan default value, and if CD123 has low values coupled with high values of TD013, it is very likely that it will be a loan default.Although, it is to note that TD013 or counnt of queries has a much larger impact on the value of loan default than the CD123.
The plot shows there is a mixed or hybrid trend between “CD123” and the target variable loan default, and “CD123” interacts with “TD013” frequently.
A high chance of loan default is associated with a low value for CD123 and a high value for TD013, as per the above given explanations. Hence, giving out of loan should be considered.
TD009 - Count of queries in the past 3 months¶shap.dependence_plot("TD009", shap_values, X_train)
In this case, we see that:
TD009 or the count of queries in last 3 months increases, the value impact on loan default increases, making it more likely to be a default. There is a positive relation between the variable and the loan default value. TD013 or the count of queries in last 6 months is high (more red), it leads to a increase in the value of the loan default, making it likely to be a default. both the variables have a positive impact on the loan default value, and if both have high values, it is very likely that it will be a loan default. There are more red dots (high value of TD013, in the area where TD009 is high.Although, it is to note that TD013 has a much larger impact on the value of loan default than the TD009.
The plot shows there is an approximately linear and positive trend between “TD009” and the target variable loan default, and “TD009” interacts with “TD013” frequently.
A high chance of loan default is associated with a high value for TD009 and TD013, as per the above given explanations. Hence, giving out of loan should be considered.
For the top 6 variables, it shows the other top 6 variables it interacts with, and their impact on the loan default value.
# Here, I am taking only the first 2,000 rows to save run time
shap_interaction_values = shap.TreeExplainer(model).shap_interaction_values(X_train.iloc[:2000,:])
shap.summary_plot(shap_interaction_values, X_train.iloc[:2000,:])
The force plots help explain the predicted value for each observation.
In order to show how the SHAP values can be done on individual cases, I will execute on several random observations. I randomly chose a few observations in as shown below:
X_output = X_test.copy()
X_output.loc[:,'predict'] = np.round(model.predict(X_output),2)
random_picks = np.arange(1,4000,500)
S = X_output.iloc[random_picks]
S
# Initialize the jupyter notebook
shap.initjs()
I write a small function shap_plot(j) to execute the SHAP values for the observations.
In the code below, the shap.force_plot() takes three values:
The base value or the expected value is the average of the model output over the training data X_train. It is the base value used in the following plot.
def shap_plot(j):
explainerModel = shap.TreeExplainer(model)
shap_values_Model = explainerModel.shap_values(S)
p = shap.force_plot(explainerModel.expected_value, shap_values_Model[j], S.iloc[[j]])
return(p)
means = pd.DataFrame(X_train.mean()).reset_index()
means.columns = ['Variable', 'Mean']
print(means[means['Variable'] == 'TD013'])
print()
print(means[means['Variable'] == 'CR015_6'])
print()
print(means[means['Variable'] == 'MB005'])
print()
print(means[means['Variable'] == 'TD005'])
print()
print(means[means['Variable'] == 'TD009'])
print()
print(means[means['Variable'] == 'AP004_12'])
print()
print(means[means['Variable'] == 'AP003_3'])
print()
print(means[means['Variable'] == 'AP003_4'])
print()
print(means[means['Variable'] == 'CR015_3'])
print()
print(means[means['Variable'] == 'CD123'])
# Mean or the base value of Y
round(Y_test.mean(), 2)
shap_plot(0)
Output value: This is the prediction for that observation, which here is 0.15, lower than the mean
Base value: The base value E (y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean (y_hat). So the mean prediction of Y_test is 0.19.
Features: The above plot shows features that contributes to push the final prediction away from the base value.
Red/blue colors: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
Negative (Blue)
TD013: It has a negative impact on the loan default value. The TD013 or the Count of Queries in the past 6 months observation is 2 which is lower than the average value 6.81. So it pushes the prediction to the left, away from the base value or mean value. As we saw in the summary plot, a high value of this variable results in a likely loan default, hence here, since this value is low, it pushes towards a less likely loan default, to the left.
CR015_6: It has a negative impact on the loan default value. The CR015_6 or the Months of Credit Card MOB Max Level 6, has a value of 1 here, which is higher than its average of 0.34, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
MB005: It has a negative impact on the loan default value. The MB005 or the Years Phone is active, has a value of 11 here, which is higher than its average of 5.97, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
Positive (Red)
CD123: It has a positive impact on the loan default value. The CD123 or Count of Distinct Outbound call in past 3 months, has a value of 32 here, which is lower than its average of 121.59, it pushes the predicted loan default value to the right. As we saw in the summary plot, a low value of this variable results in a more likely loan default, hence here, since this value is low, it pushes towards a more likely loan default, to the right.shap_plot(1)
Output value: This is the prediction for that observation, which here is 0.23, higher than the mean
Base value: The base value E (y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean (y_hat). So the mean prediction of Y_test is 0.19.
Features: The above plot shows features that contributes to push the final prediction away from the base value.
Red/blue colors: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
Negative (Blue)
AP003_3: It has a negative impact on the loan default value. The AP003_3 or Education Code Level 3, has a value of 1 here, which is higher than its average of 0.29, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
CD123: It has a negative impact on the loan default value. The CD123 or Count of Distinct Outbound call in past 3 months, has a value of 274 here, which is higher than its average of 121.59, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
Positive (Red)
TD013: It has a positive impact on the loan default value. The TD013 or the Count of Queries in the past 6 months observation is 10 which is higher than the average value 6.81. So it pushes the prediction to the right, towards the base value or mean value. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
AP004_12: It has a positive impact on the loan default value. The AP004_12 or a Loan term of 12 years, has a value of 1 here, which is higher than its average of 0.87, it pushes the predicted loan default value to the right. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
shap_plot(2)
Output value: This is the prediction for that observation, which here is 0.15, lower than the mean
Base value: The base value E (y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean (y_hat). So the mean prediction of Y_test is 0.19.
Features: The above plot shows features that contributes to push the final prediction away from the base value.
Red/blue colors: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
Negative (Blue)
TD013: It has a negative impact on the loan default value. The TD013 or the Count of Queries in the past 6 months observation is 2 which is lower than the average value 6.81. So it pushes the prediction to the left, away from the base value or mean value. As we saw in the summary plot, a low value of this variable results in a less likely loan default, hence here, since this value is low, it pushes towards a less likely loan default, to the left.
CR015_6: It has a negative impact on the loan default value. The CR015_6 or the Months of Credit Card MOB Max Level 6, has a value of 1 here, which is higher than its average of 0.34, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
Positive (Red)
AP004_12: It has a positive impact on the loan default value. The AP004_12 or a Loan term of 12 years, has a value of 1 here, which is higher than its average of 0.87, it pushes the predicted loan default value to the right. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
CD123: It has a positive impact on the loan default value. The CD123 or Count of Distinct Outbound call in past 3 months, has a value of 61 here, which is lower than its average of 121.59, it pushes the predicted loan default value to the right. As we saw in the summary plot, a low value of this variable results in a more likely loan default, hence here, since this value is low, it pushes towards a more likely loan default, to the right.
shap_plot(3)
Output value: This is the prediction for that observation, which here is 0.18, near to the mean
Base value: The base value E (y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean (y_hat). So the mean prediction of Y_test is 0.19.
Features: The above plot shows features that contributes to push the final prediction away from the base value.
Red/blue colors: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
Negative (Blue)
TD013: It has a negative impact on the loan default value. The TD013 or the Count of Queries in the past 6 months observation is 3 which is lower than the average value 6.81. So it pushes the prediction to the left. As we saw in the summary plot, a low value of this variable results in a less likely loan default, hence here, since this value is low, it pushes towards a less likely loan default, to the left.
MB005: It has a negative impact on the loan default value. The MB005 or the Years Phone is active, has a value of 6 here, which is higher than its average of 5.97, it pushes the predicted loan default value to the left. As we saw in the summary plot, a high value of this variable results in a less likely loan default, hence here, since this value is high, it pushes towards a less likely loan default, to the left.
Positive (Red)
CD123: It has a positive impact on the loan default value. The CD123 or Count of Distinct Outbound call in past 3 months, has a value of 53 here, which is lower than its average of 121.59, it pushes the predicted loan default value to the right. As we saw in the summary plot, a low value of this variable results in a more likely loan default, hence here, since this value is low, it pushes towards a more likely loan default, to the right.
AP004_12: It has a positive impact on the loan default value. The AP004_12 or a Loan term of 12 years, has a value of 1 here, which is higher than its average of 0.87, it pushes the predicted loan default value to the right. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
shap_plot(4)
Output value: This is the prediction for that observation, which here is 0.30, higher than the mean
Base value: The base value E (y_hat) is "the value that would be predicted if we did not know any features for the current output." In other words, it is the mean prediction, or mean (y_hat). So the mean prediction of Y_test is 0.19.
Features: The above plot shows features that contributes to push the final prediction away from the base value.
Red/blue colors: Those features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
Positive (Red)
TD013: It has a positive impact on the loan default value. The TD013 or the Count of Queries in the past 6 months observation is 7 which is higher than the average value 6.81. So it pushes the prediction to the right. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
AP003_3: It has a positive impact on the loan default value. The AP003_3 or Education Code Level 3, has a value of 0 here, which is lower than its average of 0.29, it pushes the predicted loan default value to the right. As we saw in the summary plot, a low value of this variable results in a more likely loan default, hence here, since this value is low, it pushes towards a more likely loan default, to the right.
AP004_12: It has a positive impact on the loan default value. The AP004_12 or a Loan term of 12 years, has a value of 1 here, which is higher than its average of 0.87, it pushes the predicted loan default value to the right. As we saw in the summary plot, a high value of this variable results in a more likely loan default, hence here, since this value is high, it pushes towards a more likely loan default, to the right.
AP003_4: It has a positive impact on the loan default value. The AP003_4 or Education Code Level 4, has a value of 0 here, which is lower than its average of 0.17, it pushes the predicted loan default value to the right. As we saw in the summary plot, a low value of this variable results in a more likely loan default, hence here, since this value is low, it pushes towards a more likely loan default, to the right.
.
.
These are just for illustrative purpose:
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html
Heatmap

Correlation Map

Individual Conditional Expectation (ICE)

LIME (Local Interpretable Model-agnostic Explanations) builds sparse linear models around an individual prediction in its local vicinity. This is documented in Lundberg and Lee (2016) that LIME is actually a subset of SHAP but lacks the same properties.
Readers may ask: “If SHAP is already a united solution, why should we consider LIME?”
The two methods emerge very differently. The advantage of LIME is speed. LIME perturbs data around an individual prediction to build a model, while SHAP has to compute all permutations globally to get local accuracy. Further, the SHAP Python module does not yet have specifically optimized algorithms for all types of algorithms (such as KNNs), as I have documented in “Explain Any Models with the SHAP Values — Use the KernelExplainer” that test models in KNN, SVM, Random Forest, GBM, or the H2O module.
X_var_list = df1.columns.to_list()
X_var_list = X_var_list[1:]
# X_var_list
Y = df1['loan_default']
X = df1[X_var_list]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)
model = RandomForestRegressor(
max_depth = 6,
random_state = 0,
n_estimators=10)
model.fit(X_train, Y_train)
import lime
import lime.lime_tabular
X_featurenames = X.columns
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names = X_featurenames,
class_names = ['loan_default'],
#categorical_features = ['AP003', 'AP004', 'MB007', 'CR015'],
verbose=True, mode='regression')
exp = explainer.explain_instance(X_test.iloc[0], model.predict)
exp.as_pyplot_figure()
pd.DataFrame(exp.as_list())
exp.show_in_notebook(show_table = True, show_all = False)
SHAP Values are a very important methodology to understand and explian the variables which result a certain prediction value in a random forest model.
Three benefits worth mentioning here: